Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 9 de 9
Filtrar
Mais filtros










Base de dados
Intervalo de ano de publicação
1.
Sci Rep ; 11(1): 1696, 2021 01 18.
Artigo em Inglês | MEDLINE | ID: mdl-33462256

RESUMO

The increased diversity and scale of published biological data has to led to a growing appreciation for the applications of machine learning and statistical methodologies to gain new insights. Key to achieving this aim is solving the Relationship Extraction problem which specifies the semantic interaction between two or more biological entities in a published study. Here, we employed two deep neural network natural language processing (NLP) methods, namely: the continuous bag of words (CBOW), and the bi-directional long short-term memory (bi-LSTM). These methods were employed to predict relations between entities that describe protein subcellular localisation in plants. We applied our system to 1700 published Arabidopsis protein subcellular studies from the SUBA manually curated dataset. The system combines pre-processing of full-text articles in a machine-readable format with relevant sentence extraction for downstream NLP analysis. Using the SUBA corpus, the neural network classifier predicted interactions between protein name, subcellular localisation and experimental methodology with an average precision, recall rate, accuracy and F1 scores of 95.1%, 82.8%, 89.3% and 88.4% respectively (n = 30). Comparable scoring metrics were obtained using the CropPAL database as an independent testing dataset that stores protein subcellular localisation in crop species, demonstrating wide applicability of prediction model. We provide a framework for extracting protein functional features from unstructured text in the literature with high accuracy, improving data dissemination and unlocking the potential of big data text analytics for generating new hypotheses.

2.
Adv Exp Med Biol ; 1346: 67-89, 2021.
Artigo em Inglês | MEDLINE | ID: mdl-35113396

RESUMO

In eukaryotic organisms, subcellular protein location is critical in defining protein function and understanding sub-functionalization of gene families. Some proteins have defined locations, whereas others have low specificity targeting and complex accumulation patterns. There is no single approach that can be considered entirely adequate for defining the in vivo location of all proteins. By combining evidence from different approaches, the strengths and weaknesses of different technologies can be estimated, and a location consensus can be built. The Subcellular Location of Proteins in Arabidopsis database ( http://suba.live/ ) combines experimental data sets that have been reported in the literature and is analyzing these data to provide useful tools for biologists to interpret their own data. Foremost among these tools is a consensus classifier (SUBAcon) that computes a proposed location for all proteins based on balancing the experimental evidence and predictions. Further tools analyze sets of proteins to define the abundance of cellular structures. Extending these types of resources to plant crop species has been complex due to polyploidy, gene family expansion and contraction, and the movement of pathways and processes within cells across the plant kingdom. The Crop Proteins of Annotated Location database ( http://crop-pal.org/ ) has developed a range of subcellular location resources including a species-specific voting consensus for 12 plant crop species that offers collated evidence and filters for current crop proteomes akin to SUBA. Comprehensive cross-species comparison of these data shows that the sub-cellular proteomes (subcellulomes) depend only to some degree on phylogenetic relationship and are more conserved in major biosynthesis than in metabolic pathways. Together SUBA and cropPAL created reference subcellulomes for plants as well as species-specific subcellulomes for cross-species data mining. These data collections are increasingly used by the research community to provide a subcellular protein location layer, inform models of compartmented cell function and protein-protein interaction network, guide future molecular crop breeding strategies, or simply answer a specific question-where is my protein of interest inside the cell?


Assuntos
Arabidopsis , Arabidopsis/genética , Bases de Dados de Proteínas , Humanos , Filogenia , Proteômica , Especificidade da Espécie , Frações Subcelulares
3.
Plant J ; 104(3): 812-827, 2020 11.
Artigo em Inglês | MEDLINE | ID: mdl-32780488

RESUMO

Agriculture faces increasing demand for yield, higher plant-derived protein content and diversity while facing pressure to achieve sustainability. Although the genomes of many of the important crops have been sequenced, the subcellular locations of most of the encoded proteins remain unknown or are only predicted. Protein subcellular location is crucial in determining protein function and accumulation patterns in plants, and is critical for targeted improvements in yield and resilience. Integrating location data from over 800 studies for 12 major crop species into the cropPAL2020 data collection showed that while >80% of proteins in most species are not localised by experimental data, combining species data or integrating predictions can help bridge gaps at similar accuracy. The collation and integration of over 61 505 experimental localisations and more than 6 million predictions showed that the relative sizes of the protein catalogues located in different subcellular compartments are comparable between crops and Arabidopsis. A comprehensive cross-species comparison showed that between 50% and 80% of the subcellulomes are conserved across species and that conservation only depends to some degree on the phylogenetic relationship of the species. Protein subcellular locations in major biosynthesis pathways are more often conserved than in metabolic pathways. Underlying this conservation is a clear potential for subcellular diversity in protein location between species by means of gene duplication and alternative splicing. Our cropPAL data set and search platform (https://crop-pal.org) provide a comprehensive subcellular proteomics resource to drive compartmentation-based approaches for improving yield, protein composition and resilience in future crop varieties.


Assuntos
Produtos Agrícolas/metabolismo , Bases de Dados de Proteínas , Proteínas de Plantas/metabolismo , Compartimento Celular , Produtos Agrícolas/citologia , Melhoramento Vegetal , Células Vegetais/metabolismo , Especificidade da Espécie
4.
Mol Plant ; 13(2): 215-230, 2020 02 03.
Artigo em Inglês | MEDLINE | ID: mdl-31760160

RESUMO

The RNA-binding pentatricopeptide repeat (PPR) family comprises hundreds to thousands of genes in most plants, but only a few dozen in algae, indicating massive gene expansions during land plant evolution. The nature and timing of these expansions has not been well defined due to the sparse sequence data available from early-diverging land plant lineages. In this study, we exploit the comprehensive OneKP datasets of over 1000 transcriptomes from diverse plants and algae toward establishing a clear picture of the evolution of this massive gene family, focusing on the proteins typically associated with RNA editing, which show the most spectacular variation in numbers and domain composition across the plant kingdom. We characterize over 2 250 000 PPR motifs in over 400 000 proteins. In lycophytes, polypod ferns, and hornworts, nearly 10% of expressed protein-coding genes encode putative PPR editing factors, whereas they are absent from algae and complex-thalloid liverworts. We show that rather than a single expansion, most land plant lineages with high numbers of editing factors have continued to generate novel sequence diversity. We identify sequence variations that imply functional differences between PPR proteins in seed plants versus non-seed plants and variations we propose to be linked to seed-plant-specific editing co-factors. Finally, using the sequence variations across the datasets, we develop a structural model of the catalytic DYW domain associated with C-to-U editing and identify a clade of unique DYW variants that are strong candidates as U-to-C RNA-editing factors, given their phylogenetic distribution and sequence characteristics.


Assuntos
Embriófitas/genética , Proteínas de Plantas/genética , Edição de RNA/genética , Proteínas de Ligação a RNA/genética , Motivos de Aminoácidos , Bases de Dados Genéticas , Embriófitas/classificação , Evolução Molecular , Duplicação Gênica , Variação Genética , Modelos Moleculares , Filogenia , Proteínas de Plantas/química , Proteínas de Plantas/metabolismo , Plantas/classificação , Plantas/genética , Domínios Proteicos , RNA de Plantas/metabolismo , Proteínas de Ligação a RNA/química , Proteínas de Ligação a RNA/metabolismo , Sequências Repetitivas de Aminoácidos
5.
Plant J ; 92(6): 1202-1217, 2017 Dec.
Artigo em Inglês | MEDLINE | ID: mdl-29024340

RESUMO

Measuring changes in protein or organelle abundance in the cell is an essential, but challenging aspect of cell biology. Frequently-used methods for determining organelle abundance typically rely on detection of a very few marker proteins, so are unsatisfactory. In silico estimates of protein abundances from publicly available protein spectra can provide useful standard abundance values but contain only data from tissue proteomes, and are not coupled to organelle localization data. A new protein abundance score, the normalized protein abundance scale (NPAS), expands on the number of scored proteins and the scoring accuracy of lower-abundance proteins in Arabidopsis. NPAS was combined with subcellular protein localization data, facilitating quantitative estimations of organelle abundance during routine experimental procedures. A suite of targeted proteomics markers for subcellular compartment markers was developed, enabling independent verification of in silico estimates for relative organelle abundance. Estimation of relative organelle abundance was found to be reproducible and consistent over a range of tissues and growth conditions. In silico abundance estimations and localization data have been combined into an online tool, multiple marker abundance profiling, available in the SUBA4 toolbox (http://suba.live).


Assuntos
Proteínas de Arabidopsis/metabolismo , Arabidopsis/metabolismo , Proteoma , Proteômica , Biomarcadores/metabolismo , Organelas/metabolismo , Transporte Proteico
6.
Nucleic Acids Res ; 45(D1): D1064-D1074, 2017 01 04.
Artigo em Inglês | MEDLINE | ID: mdl-27899614

RESUMO

The SUBcellular location database for Arabidopsis proteins (SUBA4, http://suba.live) is a comprehensive collection of manually curated published data sets of large-scale subcellular proteomics, fluorescent protein visualization, protein-protein interaction (PPI) as well as subcellular targeting calls from 22 prediction programs. SUBA4 contains an additional 35 568 localizations totalling more than 60 000 experimental protein location claims as well as 37 new suborganellar localization categories. The experimental PPI data has been expanded to 26 327 PPI pairs including 856 PPI localizations from experimental fluorescent visualizations. The new SUBA4 user interface enables users to choose quickly from the filter categories: 'subcellular location', 'protein properties', 'protein-protein interaction' and 'affiliations' to build complex queries. This allows substantial expansion of search parameters into 80 annotation types comprising 1 150 204 new annotations to study metadata associated with subcellular localization. The 'BLAST' tab contains a sequence alignment tool to enable a sequence fragment from any species to find the closest match in Arabidopsis and retrieve data on subcellular location. Using the location consensus SUBAcon, the SUBA4 toolbox delivers three novel data services allowing interactive analysis of user data to provide relative compartmental protein abundances and proximity relationship analysis of PPI and coexpression partners from a submitted list of Arabidopsis gene identifiers.


Assuntos
Proteínas de Arabidopsis/metabolismo , Arabidopsis/metabolismo , Biologia Computacional/métodos , Bases de Dados de Proteínas , Mapeamento de Interação de Proteínas , Mapas de Interação de Proteínas , Espaço Intracelular/metabolismo , Anotação de Sequência Molecular , Transporte Proteico , Proteômica , Software , Navegador
7.
Plant Cell Physiol ; 57(1): e9, 2016 Jan.
Artigo em Inglês | MEDLINE | ID: mdl-26556651

RESUMO

Barley, wheat, rice and maize provide the bulk of human nutrition and have extensive industrial use as agricultural products. The genomes of these crops each contains >40,000 genes encoding proteins; however, the major genome databases for these species lack annotation information of protein subcellular location for >80% of these gene products. We address this gap, by constructing the compendium of crop protein subcellular locations called crop Proteins with Annotated Locations (cropPAL). Subcellular location is most commonly determined by fluorescent protein tagging of live cells or mass spectrometry detection in subcellular purifications, but can also be predicted from amino acid sequence or protein expression patterns. The cropPAL database collates 556 published studies, from >300 research institutes in >30 countries that have been previously published, as well as compiling eight pre-computed subcellular predictions for all Hordeum vulgare, Triticum aestivum, Oryza sativa and Zea mays protein sequences. The data collection including metadata for proteins and published studies can be accessed through a search portal http://crop-PAL.org. The subcellular localization information housed in cropPAL helps to depict plant cells as compartmentalized protein networks that can be investigated for improving crop yield and quality, and developing new biotechnological solutions to agricultural challenges.


Assuntos
Bases de Dados Genéticas , Genoma de Planta/genética , Hordeum/genética , Oryza/genética , Triticum/genética , Zea mays/genética , Sequência de Aminoácidos , Biologia Computacional , Produtos Agrícolas , Hordeum/metabolismo , Proteínas de Plantas/genética , Transporte Proteico
8.
Bioinformatics ; 30(23): 3356-64, 2014 Dec 01.
Artigo em Inglês | MEDLINE | ID: mdl-25150248

RESUMO

MOTIVATION: Knowing the subcellular location of proteins is critical for understanding their function and developing accurate networks representing eukaryotic biological processes. Many computational tools have been developed to predict proteome-wide subcellular location, and abundant experimental data from green fluorescent protein (GFP) tagging or mass spectrometry (MS) are available in the model plant, Arabidopsis. None of these approaches is error-free, and thus, results are often contradictory. RESULTS: To help unify these multiple data sources, we have developed the SUBcellular Arabidopsis consensus (SUBAcon) algorithm, a naive Bayes classifier that integrates 22 computational prediction algorithms, experimental GFP and MS localizations, protein-protein interaction and co-expression data to derive a consensus call and probability. SUBAcon classifies protein location in Arabidopsis more accurately than single predictors. AVAILABILITY: SUBAcon is a useful tool for recovering proteome-wide subcellular locations of Arabidopsis proteins and is displayed in the SUBA3 database (http://suba.plantenergy.uwa.edu.au). The source code and input data is available through the SUBA3 server (http://suba.plantenergy.uwa.edu.au//SUBAcon.html) and the Arabidopsis SUbproteome REference (ASURE) training set can be accessed using the ASURE web portal (http://suba.plantenergy.uwa.edu.au/ASURE).


Assuntos
Algoritmos , Proteínas de Arabidopsis/análise , Arabidopsis/química , Proteoma/análise , Arabidopsis/genética , Arabidopsis/metabolismo , Proteínas de Arabidopsis/genética , Proteínas de Arabidopsis/metabolismo , Teorema de Bayes , Bases de Dados de Proteínas , Proteínas de Fluorescência Verde/genética , Espectrometria de Massas , Proteínas de Membrana/análise , Mapeamento de Interação de Proteínas , Proteoma/genética , Proteoma/metabolismo , Software
9.
J Chem Inf Model ; 52(8): 1917-25, 2012 Aug 27.
Artigo em Inglês | MEDLINE | ID: mdl-22725641

RESUMO

The provision of precise metadata is an important but a largely underrated challenge for modern science [Nature 2009, 461, 145]. We describe here a dictionary methods language dREL that has been designed to enable complex data relationships to be expressed as formulaic scripts in data dictionaries written in DDLm [Spadaccini and Hall J. Chem. Inf. Model.2012 doi:10.1021/ci300075z]. dREL describes data relationships in a simple but powerful canonical form that is easy to read and understand and can be executed computationally to evaluate or validate data. The execution of dREL expressions is not a substitute for traditional scientific computation; it is to provide precise data dependency information to domain-specific definitions and a means for cross-validating data. Some scientific fields apply conventional programming languages to methods scripts but these tend to inhibit both dictionary development and accessibility. dREL removes the programming barrier and encourages the production of the metadata needed for seamless data archiving and exchange in science.


Assuntos
Dicionários como Assunto , Informática/métodos , Linguagens de Programação
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA
...